This work comes from following along with the R for Data Science book. The book is freely available online at https://r4ds.had.co.nz/

Chapter 1: Introduction

# import the tidyverse library
library(tidyverse)

# update packages
# tidyverse_update()

# load other packages required for the book
library(nycflights13)

# load ggplot
library(ggplot2)

# used in chapter 7
library(ggstance)
library(lvplot)

Chapter 2: Introduction

Back to Top

We’ll skill Chapter 2 since there’s not much to write up there

Chapter 3: Data Visualization

Back to Top

3.2 First Steps

Use the mpg data frame and visualization techniques to answer this question: Do cars with big engines use more fuel than cars with small engines?

3.2.1 The mpg data frame

Back to Top

The mpg dataframe is included in the ggplot2 package. It contains data on 38 car models, collected by the US Environmental Protection Agency.

mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l~ f        18    29 p     comp~
##  2 audi         a4         1.8  1999     4 manual~ f        21    29 p     comp~
##  3 audi         a4         2    2008     4 manual~ f        20    31 p     comp~
##  4 audi         a4         2    2008     4 auto(a~ f        21    30 p     comp~
##  5 audi         a4         2.8  1999     6 auto(l~ f        16    26 p     comp~
##  6 audi         a4         2.8  1999     6 manual~ f        18    26 p     comp~
##  7 audi         a4         3.1  2008     6 auto(a~ f        18    27 p     comp~
##  8 audi         a4 quat~   1.8  1999     4 manual~ 4        18    26 p     comp~
##  9 audi         a4 quat~   1.8  1999     4 auto(l~ 4        16    25 p     comp~
## 10 audi         a4 quat~   2    2008     4 manual~ 4        20    28 p     comp~
## # ... with 224 more rows

We can use ?mpg to get more information on the dataframe.

The dataframe has 234 rows with the following 11 variables:

  1. manufacturer - manufacturer name
  2. model - model name
  3. displ - engine displacement, in litres
  4. year - year of manufacture
  5. cyl - number of cylinders
  6. trans - type of transmission
  7. drv - the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd
  8. cty - city miles per gallon
  9. hwy - highway miles per gallon
  10. fl - fuel type
  11. class - “type” of car

3.2.2 Creating a ggplot

Back to Top

Plot disp on the x-axis and hwy on the y-axis.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

Notice the negative relationship between engine size (displ) and highway fuel efficiency (hwy), indicating that cars with big engines use more fuel.

Not in book: Let’s try this with city mileage as well.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = cty))

Looks pretty much the same as highway mileage.

Notes on ggplot

The call to ggplot() create a coordinate system which we’ll add layers to. We connect a dataset to the plot using the data argument. This creates an empty plot.

3.2.4 Exercises

  1. Run ggplot(data = mpg). What do you see?
ggplot(data = mpg)

Just a blank figure.

  1. How many rows are in mpg? How many columns?
# rows in mpg
print(paste("rows =", nrow(mpg)))
## [1] "rows = 234"
# columns in mpg
print(paste("columns =", ncol(mpg)))
## [1] "columns = 11"
  1. What does the drv variable describe? Read the help for ?mpg to find out. ?mpg drv describes the type of drive train in the vehicle, where
  • f = front-wheel drive
  • r = rear-wheel drive
  • 4 = 4-wheel drive
  1. Make a scatterplot of hwy vs cyl.
ggplot(data = mpg) +
  geom_point(mapping = aes(x = hwy, y = cyl))

  1. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
ggplot(data = mpg) +
  geom_point(mapping = aes(x = class, y = drv))

This is not useful because both of these variables are categorical. We’re not gaining much information by displaying the data as a scatter plot.

3.3 Aesthetic mappings

Mapping aestetics in your plot to variables in the data can help to illuminate interesting patterns. Let’s map the class cariable to a plot of displ and hwy to see how those distribute.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

We’ve mapping the colors of the points to the class variable. The 2seater class seem to be outliers with large engines and good hwy mileage. This makes sense because these sports cars have large engines with small bodies.

We can also map variables to the size aesthetic. It’s best not to map size to a categorical variable.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.

We also have the alpha or shape aesthetics.

# Left
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.

# Right
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).

The shapes aesthetic can only map six categories, and the rest will not be included.

You can also set the aesthetic properties manually. When the aesthetic is set manually (and not depending on a variable), it must be set OUTSIDE of the aes call.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), color = 'blue')

3.3.1 Exercises

  1. What’s gone wrong with this code? Why are the points not blue? ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

The color is set manually, so it needs to be outside of the aes() call.

  1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg? ?mpg str(mpg)

The categorical variables are:

  • manufactuerer
  • model
  • year
  • cyl
  • trans
  • drv
  • cty
  • hwy
  • fl
  • class

The continuous variables are:

  • displ

When we look at the mpg dataframe. We can look at the type of data that is in each column. chr variables are categorical. num variables are continuous. int variables aren’t technically categorical nor continuous. Treating them as categorical is actually reducing the information that we have, so we might want to consider them as ordinal.

  1. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

Continuous Case:

# map color
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = displ))

# map size
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, size = displ))

# map shape - doesn't work
# ggplot(data = mpg) +
#   geom_point(mapping = aes(x = displ, y = hwy, shape = displ))

Categorical Case:

# map color
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = drv))

# map size
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, size = drv))
## Warning: Using size for a discrete variable is not advised.

# map shape
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, shape = drv))

Shape doesn’t map with a continuous variable. Color takes on a gradient.

Size doesn’t work well with a categorical variable.

  1. What happens if you map the same variable to multiple aesthetics?
# map color and size
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = displ, size = displ))

We get both of the aesthetics on the same plot.

  1. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point) ?geom_point
# map color
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, stroke = 5))

The stroke aesthetic modifies the width of the border for each point.

  1. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.
# map color
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

We can set functions to set the aesthetics according to the data.

3.5 Facets

Facets allow us to split plots into multiple subplots. We can facet the plot by a single variable using facet_wrap(). The first argument of facep_wrap() is a formula which specifies the variable to so split the plot on.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~ class, nrow = 2)

We can facet the plot on the combination to two variables as well, using facet_grid().

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cyl)

To avoid rows and columns, we can use . instead of a variable name.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, 
                           color = class)) +
  facet_grid(. ~ class)

3.5.1 Exercises

  1. What happens when you facet on a continuous variable?
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cty)

Creating the facet using a continuous variable splits up the graph into each of the continuous values, resulting in a terrible graph.

  1. What do the empty cells in plot with facet_grid(drv ~ cyl) mean?
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cyl)

The empty plots are facets that did not have any corresponding data. For instance, rear-wheel-drive did not have any cars with engine displacement (displ) of 4 or 5 litres.

How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

We see that we’re missing points at each of the variable combinations that aren’t present in our dataset.

  1. What plots does the following code make? What does . do?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

We create facets using just one variable. The . retains the rows or columns so that we don’t tile in that direction.

  1. Take the first faceted plot in this section:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

We can look at the color aesthetic here to compare.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

One thing we can see immediately is that the overlapping points make it more difficult to look at a single group when we use the color aesthetic compared to the facet aesthetic. If we’re interested in comparing points and looking for patterns within groups, then the facet aesthetic is more usefule here.

A disadvantage of using facets is that we lose some ability to compare across groups. With the color aesthetic, we immediately see how groups are mixed into one another, but this is less obvious with the facets.

Depending on how many groups we’re comparing, each of these aesthetics will change as well. If we have too many groups, facets will quickly become overwhelming to look at. In larger datasets with fewer groups, we might prefer to separate groups using facets so that we have less density per plot.

  1. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

The facet_wrap function wraps a sequence of panels into a 2-dimensional figure. The nrow and ncol arguments allow you to define the number of rows and columns for the wrapped figure.

The as.table option controls the order of the panels in the output. If as.table is TRUE (default), then te highest values will be at the bottom-right, otherwise, the highest value is at the top-left.

The dir option specifies whether the direction should be “h” for horizontal (default) or “v” for vertical.

facet_grid() does not have the nrow or ncol argumetns because we pass in the rows and cols arguments to define the groups that will create the rows and columns of the grid.

  1. When using facet_grid() you should usually put the variable wit more unique levels in the columns. Why?

This would create a landscape figure rather than a tall portait figure.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(year ~ cyl)

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(cyl ~ year)

The graphs are clearer and easier to read when we put the variable with more levels into the columns.

3.6 Geometric Objects

Plots can represent the same data with different visual objects. These are referred to as geoms. Take the following two graphs for example.

# scatter plot
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

# line plot
ggplot(data=mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Every geom in ggplot2 takes a mapping argument, however not every aesthetic can work with every geom. For instance, you can set the shape of a point, but not of a line.

Linetype can be set for a line.

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The drv variable describes the car’s drivetrain, which include 4-wheel-drive, forward-wheel-drive, and rear-wheel-drive. We’ve set the aesthetic to separate the data according to drivetrain and the linetype reflects each group.

We can see this even more clearly by overlapping multiple geoms on one plot.

ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(linetype = drv, color = drv)) +
  geom_point(mapping = aes(color = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Here are some resources for the different geoms available for ggplot2:

Geoms like geom_smooth() use a single object to represent the data. We can set the group aesthetic to split the object into multiple groups. The convenient thing about the group feature is that a legend is not added automatically to our plot.

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can also manually remove the legend from aesthetics that do generate one, such as the color aesthetic.

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, 
                            group = drv, color = drv),
                            show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

To include multiple geoms in the same plot, we just add them to the same ggplot() function.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

By writing the code like this, we duplicate the \(x\) and \(y\) axis assignments. This could make things difficult for updating these axes to different variables. Instead, we can add these assignments to the ggplot() function where they serve as global variables.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You can still extend or overwrite the global mappings by placing them directly in the geom function.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You can use this to specify different data for the geoms as well. Here, we’ll subset the smooth line to just the subcompact cars.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth(data = filter(mpg, class == 'subcompact'), 
              se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3.6.1 Exercises

  1. What geom would you use to draw a
  • line chart? geom_line()
  • boxplot? geom_boxplot()
  • histogram? geom_histogram()
  • area chart? geom_area()
  1. Run this code in your head and predict what the output will look like. Then run the code in R and check your prediction.

Prediction: Map a scatter plot of displ against hwy, colored by drv. Includes smooth lines, also separated by drv and without standard errors.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter.

The show.legend flag indicates whether a legend should be included in the plot. It was probably used earlier because the legend only pertained to one of the plots.

  1. What does the se argument to geom_smooth() do? The se argument indicates whether the standard error region associated with the smooth line should be displayed.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = TRUE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. Will these two graphs look different? Why/Why not?

Prediction: No, because the first plot simply sets global aesthetics, while the second plot sets the same aesthetics locally.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. Recreate the R code necessary to generate the following graphs.

Top left graph:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(size=4) +
  geom_smooth(se = FALSE, size=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Top right plot:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(size=4) +
  geom_smooth(se = FALSE, size=2, mapping = aes(group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Middle left plot:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
  geom_point(size=4) +
  geom_smooth(se = FALSE, size=2, mapping = aes(group = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Middle right plot:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(size=4, mapping = aes(color = drv)) +
  geom_smooth(se = FALSE, size=2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Bottom left plot:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(size=4, mapping = aes(color = drv)) +
  geom_smooth(se = FALSE, size=2, mapping = aes(linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Bottom right plot:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(size=6, color = 'gray97') +
  geom_point(size=3, mapping = aes(color = drv))

3.7 Statistical transformations

Bar charts are a useful way to look at data. Here’s a bar chart using the diamonds dataset, groupbed by cuts. This dataset contiains information about ~54,000 diamonds, including price, carat, color, clarity and cut of each diamond. We can see from the plot that more diamonds are available with high quality cuts than with low quality cuts.

names(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x=cut))

The \(y\)-axis variable, count is not in the diamonds dataset. Some graphs create new values to plot.

  • bar charts, histograms, and frequency polygons bin your data and then plot bin counts
  • smoothers fit a model to your data and then plot predictions from the model
  • boxplots compute summary statistics about the distribution and then display the formatted box

The algorithm used to produce the transformation is called a stat (i.e. statistical tranformation). You can get more info about what stat a geom uses by inspecting its default stat argument. For examples, ?geom_bar shows us the the stat is “count”, which means that it uses the stat_count() function.

You can generally use stats and geoms interchangably. This produces the same graph as before,

ggplot(data=diamonds) +
  stat_count(mapping = aes(x=cut))

Every geom has a default stat and every stat has a default geom.

There are three reason why you might want to select a stat specifically:

  1. You want to override the default stat. Here we want to create a barchart where the actual data values are represented rather than counts.
demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

  1. You might want to override the default mapping from transformed variables to aesthetics. Here we display a bar chart for proportions rather than counts.
ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, y=stat(prop), group=1))

  1. You might want to emphasize the statistical transformation in your code. Here we use stat_summary() to summarize the y values for each unique x value to draw attention to the summary that we’re computing.
ggplot(data=diamonds) +
  stat_summary(
    mapping = aes(x=cut, y=depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: `fun.ymin` is deprecated. Use `fun.min` instead.
## Warning: `fun.ymax` is deprecated. Use `fun.max` instead.

3.7.1 Exercises

  1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

?stat_summary()

The default geom associated with stat_summary() is the pointrange geom.

?geom_pointrange()

# create a df with the values first
diamonds_summary <- diamonds %>%
  group_by(cut) %>%
  summarize(lower = min(depth), med = median(depth), upper = max(depth))

ggplot(data=diamonds_summary) +
  geom_pointrange(mapping = aes(x=cut, y=med, ymax=upper, ymin=lower)
  ) +
  ylab("depth")

  1. What does geom_col() do? How is it different to geom_bar()?

?geom_col

Both geom_col and geom_bar are geoms to create bar charts. The bar heights using geom_col represents values in the data. We need to specify two variables for each of the x and y-axes. The bar heights represent the y-axis variable’s value for each of the categories on the x-axis variable.

The geom_bar represents the bar height proportional to the number of cases in each group. We only specify one variable for the x-axis, and the bar heights are computed to represent the count for each category.

We can look at examples of the two here.

# bar height is proportional
ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut))

# bar height is proportional
ggplot(data=diamonds) +
  geom_col(mapping=aes(x=cut, y=depth))

  1. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

See geom cheat sheets. https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

  1. What variables does stat_smooth() compute? What parameters control its behaviour?

The stat_smooth() function adds a smooth line to data points that allows us to view patterns in the data.

  • The method argument allows us to choose how the smooth line is gererated with smoothing methods including “lm”, “glm”, “gam”, “loess”, or a function.
  • The function argument allos you to enter a smoothing function like “y ~ x”.
  • se is a boolean to include standard errors or not
ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Width, color=Species)) +
  geom_point() +
  geom_smooth(se=TRUE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group=1))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop.., group=1))

Without the group argument, the geom will group the data according to the x-axis variable cut. Since we are interested in proportions, we need to consider al of the variables together, so we need to have them as a single group. Notice how the color argument no longer distinguished between the different cuts when we add the group argument.

3.8 Position adjustments

We can adjust colors in bar charts using eithe the color or fill aesthetics.

ggplot(data=diamonds) + 
  geom_bar(mapping = aes(x=cut, color=cut))

ggplot(data=diamonds) + 
  geom_bar(mapping = aes(x=cut, fill=cut))

We can even map the color to another variable. The bars will reflect the data and the colors will be stacked to represent the aesthetic variable.

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, fill=clarity))

The stacking is done automatically by the position adjustment specified by the position argument. If you don’t want a stacked barchart, you can use one of the three other options:

  • "identity"
  • "dodge"
  • "fill"

The "identity" option places each object exactly where it falls in the context of the graph. For bar charts, this simply overlaps the bars. To see the overlapping bars, we need to adjust one of the other aesthetics, like alpha.

# set the transparency of the fill using alpha
ggplot(data=diamonds, mapping=aes(x=cut, fill=clarity)) +
  geom_bar(alpha=1.5, position="identity")

# remove the fill by setting fill to NA
ggplot(data=diamonds, mapping=aes(x=cut, color=clarity)) +
  geom_bar(fill=NA, position="identity")

The "fill" option works like stacking but makes each stack of bars the same height. This makes comparing proportions easier.

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, fill=clarity), position="fill")

The "dodge" option places overlapping objects beside one another. This makes for easier comparisons between individual values.

ggplot(data=diamonds) +
  geom_bar(mapping = aes(x=cut, fill=clarity), position="dodge")

One more option that is not useful for barcharts can be used with scatterplots. This is the "jitter" position which allows us to see overlapping points on a scatterplot. Notice in the following plot that we only see 126 points even though there are 234 observations.

ggplot(data=mpg) +
  geom_point(mapping = aes(x=displ, y=hwy))

We can avoid this with the "jitter" option, which adds a small amount of random noise to the points on the plot.

ggplot(data=mpg) +
  geom_point(mapping = aes(x=displ, y=hwy), position="jitter")

This can really help display the data on a plot. Another way to jitter points on a scatter plot is to use the geom_jitter() geom.

3.8.1 Exercises

  1. What is the problem with this plot? How could you improve it?
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
  geom_point()

We might want to jitter the points to view more.

ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
  geom_jitter()

Now we can see much more data on the graph.

  1. What parameters to geom_jitter() control the amount of jittering?

The width and height parameters control the amount of horizontal and vertical jitter. The default values are 40%. We can try to play around with this setting.

# more horizontal jitter
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
  geom_jitter(width=0.7)

# more veritcal jitter
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
  geom_jitter(height=0.7)

# less horizontal jitter
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
  geom_jitter(width=0.1)

# less vertical jitter
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
  geom_jitter(height=0.1)

  1. Compare and contrast geom_jitter() with geom_count().
ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
  geom_jitter()

ggplot(data=mpg, mapping = aes(x=cty, y=hwy)) +
  geom_count()

The two geoms are quite similar, with geom_jitter we look at the density of points to see how areas that are highly represnted and with geom_count we look at the size of the points instead. The benefit of geom_count is that the points are in their original position so we get a more accurate representation of the data, however, the true number of points is still somewhat hidden.

  1. What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.

The default position of geom_boxplot is “dodge2”. We can look at the class variable colored by drv to view this.

ggplot(data=mpg) +
  geom_boxplot(mapping = aes(x=class, y=hwy, color=drv))

We can try playing around with other postions as well.

ggplot(data=mpg) +
  geom_boxplot(mapping = aes(x=class, y=hwy, fill=drv, alpha=0.2), position="identity")

3.9 Coordinate Systems

The default coordinate system is the Cartesian coordinate system where the \(x\) and \(y\) coordinates act independently to determine the location of each point. There are other coordinate systems that we can use.

coord_flip() switches the \(x\) and \(y\) axes. This comes in handy for making things like horizontal box plots.

# vertical boxplots
ggplot(data = mpg, mapping=aes(x=class, y=hwy, fill=class)) +
  geom_boxplot(show.legend = FALSE)

# horizontal boxplots
ggplot(data=mpg, mapping=aes(x=class, y=hwy, fill=class)) +
  geom_boxplot(show.legend = FALSE) +
  coord_flip()

coord_quickmap() sets the aspect ratio correctly for maps. This is important for plotting spatial data.

map <- map_data("world")

ggplot(map, aes(long, lat, group=group, fill=region)) +
  geom_polygon(color="black", show.legend=FALSE)

ggplot(map, aes(long, lat, group=group, fill=region)) +
  geom_polygon(color="black", show.legend=FALSE) +
  coord_quickmap()

coord_polar() uses polar coordinates. This can reveal interesting connections between a bar chart and Coxcomb chart.

bar <- ggplot(data=diamonds) +
  geom_bar(aes(x=cut, fill=cut),
           show.legend = FALSE,
           width = 1) +
  theme(aspect.ratio=1) +
  labs(x=NULL, y=NULL)

bar + coord_flip()

bar + coord_polar()

3.9.1 Exercises

  1. Turn a stacked bar chart into a pie chard using coord_polar()
# without polar coordinates
ggplot(data=diamonds, aes(x = factor(1), fill=cut)) +
  geom_bar(width=1)

# with polar coordinates
ggplot(data=diamonds, aes(x = factor(1), fill=cut)) +
  geom_bar(width=1)  +
  coord_polar(theta="y") +
  labs(x=NULL, y=NULL)

  1. What does labs() do? Read the documentation.

labs() allows you to add labels to your plot. You can add a title, subtitle, caption, tag, and update the x and y axes’ labels.

ggplot(data=diamonds, aes(x = factor(1), fill=cut)) +
  geom_bar(width=1)  +
  coord_polar(theta="y") +
  labs(x=NULL, y=NULL,
       title="Proportion of Diamond Cuts",
       subtitle="A pie chart",
       caption="This pie chart shows us the proportion of diamonds with different cuts in the diamonds dataset. 
       Most diamonds are cute with the Ideal cut, followed by Premium.",
       tag="diamonds dataset"
       )

  1. What’s the difference between coord_quickmap() and coord_map()? coord_map() projects a portion of the earth onto a flat 2D plane and does not typically preserve straight lines. It requires a lot fo computation.

coord_quickmap() approximates the projection and does preserve straight lines. It’s better to use for smaller areas closer to the equator.

  1. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
ggplot(data=mpg, mapping=aes(x=cty, y=hwy)) +
  geom_point() +
  geom_abline() +
  coord_fixed()

This plot tells us that there is positive relationship between city miles and highway miles that a car can drive per gallon of gas. Since the points are all above the line, we see that a car always gets more highway miles than city miles.

The abline plots the diagonal line. We’re looking at the line where city miles is equal to highway miles.

3.10 The layered grammar of graphics

Now we can develop a code template that uses all of the techniques we’ve learned to generat a plot with ggplot.

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(
    mapping = aes(<MAPPINGS>),
    stat = <STAT>,
    position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>
  )

The seven parameters for this template compose the grammar of graphics, a formal system for building plots. This grammar is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinat system, and a faceting scheme.

Chapter 4 Workflow: basics

This chapter is pretty basic stuff so I’m going to skip it. See the R For Data Science book for the details: https://r4ds.had.co.nz/workflow-basics.html.

These are the topics covered in this chapter: - 4.1 coding basics - 4.2 variable naming conventions - 4.3 functions

Chapter 5 Data transformations

5.1 Introduction

5.1.1 Prerequisites

We’ll be using the nycflights13 package.

library(nycflights13)
library(tidyverse)

5.1.2 nycflights13

We’ll be using the flights dataframe from the nycflights13 package. This dataframe contains all 336,776 flights that departed from New York City in 2013. This data frame comes from the Bureau of Transportation Statistics.

head(flights)
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

The data is actualy in tibble form. A tibble is a dataframe that is slightly tweaked to work better in the tidyverse. More details on tibbles will be in the wrangle section.

5.1.3 dplyr basics

This chapter will introduce the five key dplyr functions that will allow us to solve the vast majority of data manipulation challenges.

  • Pick observations by their values: filter()
  • Reorder the rows: arrange()
  • Pick variables by their names: select()
  • Create new variables with functions of existing variables: mutate()
  • Collapse many values down to a single summary: `summarise()

These can all be used with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.

All verbs work similary:

  1. The first argument is a data frame.
  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
  3. The result is a new data frame.

5.2 Filter rows with filter()

filter() allows you to subset observations based on their values. The first argument is the dataframe. The second and subsequent arguments are the expressions that filter the dataframe.

For example, we can select all the flights on January 1st with:

filter(flights, month==1, day==1)
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 832 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We can store the filtered dataframe in a variable.

jan1 <- filter(flights, month==1, day==1)

5.2.3 Logical operators

We can use logical operators with filter to do more complex commands. Say we wanted to get all flights in November or December. We could do

filter(flights, month==1 | month==12)
## # A tibble: 55,139 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 55,129 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

A useful short-hand for this is x %in% y. This allows us select every row where \(x\) is one of the values in \(y\). We can rewrite the code above as

nov_dev <- filter(flights, month %in% c(11,12))

5.2.3 Missing Values

filter() only includes rows where the condition is true. If you want to keep NA values then ask for them explicitly.

# create a tibble
df <- tibble(x = c(1, NA, 3))
# doesn't return NA
filter(df, x>1)
## # A tibble: 1 x 1
##       x
##   <dbl>
## 1     3
# returns NA rows
filter(df, is.na(x) | x>1)
## # A tibble: 2 x 1
##       x
##   <dbl>
## 1    NA
## 2     3

5.3.4 Exercises

  1. Final all flights that
  2. Had an arrival delay of two or more hours
filter(flights, arr_delay >= 2)
## # A tibble: 127,929 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            558        -4      740            728
##  5  2013     1     1      555            600        -5      913            854
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      559            600        -1      941            910
##  9  2013     1     1      600            600         0      837            825
## 10  2013     1     1      602            605        -3      821            805
## # ... with 127,919 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Flew to Houston (IAH or HOU)
filter(flights, dest=='HOU' | dest=='IAH')
## # A tibble: 9,313 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # ... with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Were operated by United, American, or Delta
# find out abbreviations for airlines
?flights
## starting httpd help server ... done
# airlines are stored in the airlines tibble
airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.
# filter data frame with UA, AA, DL
filter(flights, carrier %in% c('UA', 'AA', 'DL'))
## # A tibble: 139,504 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            600        -6      812            837
##  5  2013     1     1      554            558        -4      740            728
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            600        -1      941            910
## 10  2013     1     1      559            600        -1      854            902
## # ... with 139,494 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Departed in summer (July, August, and September)
filter(flights, month %in% c(7,8,9))
## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ... with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Arrived more than two hours late, but didn’t leave late
filter(flights, dep_delay<=0, arr_delay>2)
## # A tibble: 34,583 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      554            558        -4      740            728
##  2  2013     1     1      555            600        -5      913            854
##  3  2013     1     1      558            600        -2      753            745
##  4  2013     1     1      558            600        -2      924            917
##  5  2013     1     1      559            600        -1      941            910
##  6  2013     1     1      600            600         0      837            825
##  7  2013     1     1      602            605        -3      821            805
##  8  2013     1     1      622            630        -8     1017           1014
##  9  2013     1     1      624            630        -6      909            840
## 10  2013     1     1      624            630        -6      840            830
## # ... with 34,573 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Were delayed by at least an hour, but made up over 30 minutes in flight.
filter(flights, dep_delay>=1, dep_delay-arr_delay >= 30)
## # A tibble: 8,974 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      701            700         1     1123           1154
##  2  2013     1     1      857            851         6     1157           1222
##  3  2013     1     1      909            810        59     1331           1315
##  4  2013     1     1     1025            951        34     1258           1302
##  5  2013     1     1     1625           1550        35     2054           2050
##  6  2013     1     1     1716           1545        91     2140           2039
##  7  2013     1     1     1900           1845        15     2212           2227
##  8  2013     1     1     1957           1945        12     2307           2329
##  9  2013     1     1     2035           2030         5     2337              5
## 10  2013     1     1     2046           2035        11     2144           2213
## # ... with 8,964 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Departed between midnight and 6am (inclusive)
filter(flights, dep_time <= 600)
## # A tibble: 9,344 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 9,334 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

between(x, left, right) is a shortcut for x >= left and x <= right. We can definitely shorten part 4: departed in summer (between july and september)

# challenge 4
filter(flights, between(month, 7, 9))
## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ... with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The rest are already simpler.

  1. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
(no_dep <- filter(flights, is.na(dep_time)))
## # A tibble: 8,255 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1       NA           1630        NA       NA           1815
##  2  2013     1     1       NA           1935        NA       NA           2240
##  3  2013     1     1       NA           1500        NA       NA           1825
##  4  2013     1     1       NA            600        NA       NA            901
##  5  2013     1     2       NA           1540        NA       NA           1747
##  6  2013     1     2       NA           1620        NA       NA           1746
##  7  2013     1     2       NA           1355        NA       NA           1459
##  8  2013     1     2       NA           1420        NA       NA           1644
##  9  2013     1     2       NA           1321        NA       NA           1536
## 10  2013     1     2       NA           1545        NA       NA           1910
## # ... with 8,245 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

There are 8,255 flights missing dep_time. These are missing arr_time and air_time. These may be flights that were canceled and never took off.

  1. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
NA^0 # = 1
## [1] 1
NA | TRUE # TRUE
## [1] TRUE
FALSE & NA # FALSE
## [1] FALSE
NA * 0 # NA
## [1] NA

We get a solution for the first three because no matter what value of the NA is, the solution will be the same. Imagine the NA was infinity in the first statement, then the answer is still 1. For the second statment NA | TRUE, we only need one side to be true and the entire is statement is true. Thus, it doesn’t matter what the NA value is. Then in the third statement FALSE & NA, we need both sides to be true and since one is already FALSE, then the entire statement is false regardless of what the NA value is. Finally, the NA * 0 statement doesn’t work because most of the time 0 times a number is 0, however if the NA were infinity, the answer is not zero.

Inf * 0
## [1] NaN

This mean that we cannot get an answer for this statement.

5.3 Arrange rows with arrange()

arrange() works similarly to filter() except it changes the order of the rows. It takes a data frame and a set of column names to order by. If you provide more than one column name, each additional column will be used to break ties in the values of the preceding column.

arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Use desc() to re-order a column in descending order.

arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Missing values are always sorted at the end.

df <- tibble(x = c(5,2,NA))
arrange(df, x)
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     2
## 2     5
## 3    NA
arrange(df, desc(x))
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     5
## 2     2
## 3    NA

5.3.1 Exercises

  1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na())
df <- tibble(x = c(1,NA,2,3,NA,4,5,NA))

# sort in descending order, with is.na()
arrange(df, desc(is.na(x)))
## # A tibble: 8 x 1
##       x
##   <dbl>
## 1    NA
## 2    NA
## 3    NA
## 4     1
## 5     2
## 6     3
## 7     4
## 8     5
  1. Sort flights to find the most delayed flights. Find the flights that left earliest.
# sort the departure delays in descending order
arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Sort flights to find the fastest (highest speed) flights.
# sort by distance traveled divided by time spent in the air
arrange(flights, distance/air_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    28     1917           1825        52     2118           1935
##  2  2013     6    29      755            800        -5     1035            909
##  3  2013     8    28      932            940        -8     1116           1051
##  4  2013     1    30     1037            955        42     1221           1100
##  5  2013    11    27      556            600        -4      727            658
##  6  2013     5    21      558            600        -2      721            657
##  7  2013    12     9     1540           1535         5     1720           1656
##  8  2013     6    10     1356           1300        56     1646           1414
##  9  2013     7    28     1322           1325        -3     1612           1432
## 10  2013     4    11     1349           1345         4     1542           1453
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Which flights travelled the farthest? Which travelled the shortest?
# sort by distance for shortest flights
arrange(flights, distance)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7    27       NA            106        NA       NA            245
##  2  2013     1     3     2127           2129        -2     2222           2224
##  3  2013     1     4     1240           1200        40     1333           1306
##  4  2013     1     4     1829           1615       134     1937           1721
##  5  2013     1     4     2128           2129        -1     2218           2224
##  6  2013     1     5     1155           1200        -5     1241           1306
##  7  2013     1     6     2125           2129        -4     2224           2224
##  8  2013     1     7     2124           2129        -5     2212           2224
##  9  2013     1     8     2127           2130        -3     2304           2225
## 10  2013     1     9     2126           2129        -3     2217           2224
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# sort by descending distance for farthest flights
arrange(flights, desc(distance))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      857            900        -3     1516           1530
##  2  2013     1     2      909            900         9     1525           1530
##  3  2013     1     3      914            900        14     1504           1530
##  4  2013     1     4      900            900         0     1516           1530
##  5  2013     1     5      858            900        -2     1519           1530
##  6  2013     1     6     1019            900        79     1558           1530
##  7  2013     1     7     1042            900       102     1620           1530
##  8  2013     1     8      901            900         1     1504           1530
##  9  2013     1     9      641            900      1301     1242           1530
## 10  2013     1    10      859            900        -1     1449           1530
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

5.4 Selection columns with select()

select() allows you to zoom in on a useful subset of columns using operations baed on the names of the variables.

# select columns by names
dplyr::select(flights, year, month, day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows
# select all columns between year and day
dplyr::select(flights, year:day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows
# select all columns except for those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # ... with 336,766 more rows, and 9 more variables: flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

There are some other useful helper functions that you can use with select():

  • starts_with(abc) matches names that begin with “abc”
  • ends_with(xyz) matches names that end with “xyz”
  • `contains(“ijk”) matches names that contain “ijk”
  • `matches(“(.)\1”) selects variables that match a regular expression. This one matches any variables that contain repeated characters.
  • num_range("x", 1:3) matches x1, x2, x3

Se more in ?select

select() can also be used to rename variables, but is rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), a variant of select() that keeps all the variables that aren’t explicitly mentioned.

rename(flights, date=day)
## # A tibble: 336,776 x 19
##     year month  date dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Another option is to use select() with everything(). This is useful if you have a handful of variables that you’d like to move to the start of the data frame.

select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00      227  2013     1     1      517            515
##  2 2013-01-01 05:00:00      227  2013     1     1      533            529
##  3 2013-01-01 05:00:00      160  2013     1     1      542            540
##  4 2013-01-01 05:00:00      183  2013     1     1      544            545
##  5 2013-01-01 06:00:00      116  2013     1     1      554            600
##  6 2013-01-01 05:00:00      150  2013     1     1      554            558
##  7 2013-01-01 06:00:00      158  2013     1     1      555            600
##  8 2013-01-01 06:00:00       53  2013     1     1      557            600
##  9 2013-01-01 06:00:00      140  2013     1     1      557            600
## 10 2013-01-01 06:00:00      138  2013     1     1      558            600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

5.4.1 Exercises

  1. Brainstorm as many ways as possible to select dep_time, dep_delay, air_time, and arr_delay from `flights.
# using select()
select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ... with 336,766 more rows
# using select() with helper function starts_with()
select(flights, starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ... with 336,766 more rows
# use select() with ends_with() and starts_with()
select(flights, ends_with("delay"), ends_with("time"), -starts_with(c("sched", "air")))
## # A tibble: 336,776 x 4
##    dep_delay arr_delay dep_time arr_time
##        <dbl>     <dbl>    <int>    <int>
##  1         2        11      517      830
##  2         4        20      533      850
##  3         2        33      542      923
##  4        -1       -18      544     1004
##  5        -6       -25      554      812
##  6        -4        12      554      740
##  7        -5        19      555      913
##  8        -3       -14      557      709
##  9        -3        -8      557      838
## 10        -2         8      558      753
## # ... with 336,766 more rows
  1. What happens if you include the name of a variable multiple times in a select() call?
select(flights, year, year)
## # A tibble: 336,776 x 1
##     year
##    <int>
##  1  2013
##  2  2013
##  3  2013
##  4  2013
##  5  2013
##  6  2013
##  7  2013
##  8  2013
##  9  2013
## 10  2013
## # ... with 336,766 more rows

If you include a variable multiple times in select, the variables will only be selected once.

  1. What does the any_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")

The any_of() function will select any of the columns using a vector. As opposed to the all_of() function, any_of() will not throw an error if the vector contains names that are not found as columns in the data frame.

select(flights, any_of(vars))
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # ... with 336,766 more rows
  1. Does the result of running the following code suprise you? How do the select helpers deal with case by default? How can you change the default?
select(flights, contains("TIME"))
## # A tibble: 336,776 x 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
##       <int>          <int>    <int>          <int>    <dbl> <dttm>             
##  1      517            515      830            819      227 2013-01-01 05:00:00
##  2      533            529      850            830      227 2013-01-01 05:00:00
##  3      542            540      923            850      160 2013-01-01 05:00:00
##  4      544            545     1004           1022      183 2013-01-01 05:00:00
##  5      554            600      812            837      116 2013-01-01 06:00:00
##  6      554            558      740            728      150 2013-01-01 05:00:00
##  7      555            600      913            854      158 2013-01-01 06:00:00
##  8      557            600      709            723       53 2013-01-01 06:00:00
##  9      557            600      838            846      140 2013-01-01 06:00:00
## 10      558            600      753            745      138 2013-01-01 06:00:00
## # ... with 336,766 more rows

The helpers ignore case by default. To change this behaviour we can use the ignore.case argument.

select(flights, contains("TIME", ignore.case=FALSE))
## # A tibble: 336,776 x 0

5.5 Add new variables with mutate()

We can add new columns that are functions of existing columns using mutate(). This will add new columns at the end of the dataset.

# create a dataframe with less columns
flights_sml <- select(flights, year:day, ends_with('delay'), distance, air_time)

# add new columns with mutate()
mutate(flights_sml, 
       gain = dep_delay - arr_delay,
       speed = distance / air_time * 60)
## # A tibble: 336,776 x 9
##     year month   day dep_delay arr_delay distance air_time  gain speed
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227    -9  370.
##  2  2013     1     1         4        20     1416      227   -16  374.
##  3  2013     1     1         2        33     1089      160   -31  408.
##  4  2013     1     1        -1       -18     1576      183    17  517.
##  5  2013     1     1        -6       -25      762      116    19  394.
##  6  2013     1     1        -4        12      719      150   -16  288.
##  7  2013     1     1        -5        19     1065      158   -24  404.
##  8  2013     1     1        -3       -14      229       53    11  259.
##  9  2013     1     1        -3        -8      944      140     5  405.
## 10  2013     1     1        -2         8      733      138   -10  319.
## # ... with 336,766 more rows

Note that you can refere to the columns that you just created.

mutate(flights_sml,
       gain = dep_delay - arr_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours)
## # A tibble: 336,776 x 10
##     year month   day dep_delay arr_delay distance air_time  gain hours
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227    -9 3.78 
##  2  2013     1     1         4        20     1416      227   -16 3.78 
##  3  2013     1     1         2        33     1089      160   -31 2.67 
##  4  2013     1     1        -1       -18     1576      183    17 3.05 
##  5  2013     1     1        -6       -25      762      116    19 1.93 
##  6  2013     1     1        -4        12      719      150   -16 2.5  
##  7  2013     1     1        -5        19     1065      158   -24 2.63 
##  8  2013     1     1        -3       -14      229       53    11 0.883
##  9  2013     1     1        -3        -8      944      140     5 2.33 
## 10  2013     1     1        -2         8      733      138   -10 2.3  
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>

If you only want to keep the new variables, use transmute().

transmute(flights, 
          gain = dep_delay - arr_delay,
          hours = air_time / 60,
          gain_per_hour = gain / hours)
## # A tibble: 336,776 x 3
##     gain hours gain_per_hour
##    <dbl> <dbl>         <dbl>
##  1    -9 3.78          -2.38
##  2   -16 3.78          -4.23
##  3   -31 2.67         -11.6 
##  4    17 3.05           5.57
##  5    19 1.93           9.83
##  6   -16 2.5           -6.4 
##  7   -24 2.63          -9.11
##  8    11 0.883         12.5 
##  9     5 2.33           2.14
## 10   -10 2.3           -4.35
## # ... with 336,766 more rows

5.5.1 Useful creation functions

There are many functions you might use with the mutate() function. Here are some useful ones.

  • arithmetic operators: +, -, *, /, ^
  • modular arithmentic: %/% (integer division) and %% (remainder)
  • logs: log(), log2(), log10()
  • offsets: lead() and lag() allow you to refer to the leading or lagging values.
(x <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
lag(x)
##  [1] NA  1  2  3  4  5  6  7  8  9
lead(x)
##  [1]  2  3  4  5  6  7  8  9 10 NA
# use this to compute runnign differences
x-lag(x)
##  [1] NA  1  1  1  1  1  1  1  1  1
# find when value changes
x != lag(x)
##  [1]   NA TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  • cumulative and rolling aggregates: cumsum() (cumulative sum), cumprod() (cumulative product), cummin() (cumulative min), cummax() (cumulative max), cummean() (cumulative mean).
x
##  [1]  1  2  3  4  5  6  7  8  9 10
cumsum(x)
##  [1]  1  3  6 10 15 21 28 36 45 55
cumprod(x)
##  [1]       1       2       6      24     120     720    5040   40320  362880
## [10] 3628800
cummin(x)
##  [1] 1 1 1 1 1 1 1 1 1 1
cummax(x)
##  [1]  1  2  3  4  5  6  7  8  9 10
cummean(x)
##  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
  • logical comparisons: <, <=, >, >=, !=, ==
  • ranking
y <- c(1,2,2,NA,3,4)

# rank in order
min_rank(y)
## [1]  1  2  2 NA  4  5
# rank in reverse order
min_rank(desc(y))
## [1]  5  3  3 NA  2  1

Here are some variants of the standard ranks.

row_number(y)
## [1]  1  2  3 NA  4  5
dense_rank(y)
## [1]  1  2  2 NA  3  4
percent_rank(y)
## [1] 0.00 0.25 0.25   NA 0.75 1.00
cume_dist(y)
## [1] 0.2 0.6 0.6  NA 0.8 1.0

5.5.2 Exercises

  1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
# make a smaller data frame
flights_sml <- select(flights, dep_time, sched_dep_time)

# multiply hours by 60 and add to minutes
# time is formatted as HHMM or HMM
mutate(flights_sml,
       new_dep_time = dep_time %/% 100 * 60 + dep_time %% 100,
       new_sched_dep_time = sched_dep_time %/% 100 * 60 + sched_dep_time %% 100
       )
## # A tibble: 336,776 x 4
##    dep_time sched_dep_time new_dep_time new_sched_dep_time
##       <int>          <int>        <dbl>              <dbl>
##  1      517            515          317                315
##  2      533            529          333                329
##  3      542            540          342                340
##  4      544            545          344                345
##  5      554            600          354                360
##  6      554            558          354                358
##  7      555            600          355                360
##  8      557            600          357                360
##  9      557            600          357                360
## 10      558            600          358                360
## # ... with 336,766 more rows
  1. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

I would expect to see the duration of the flight.

flights_sml <- select(flights, arr_time, dep_time, air_time)

mutate(flights_sml,
       duration = arr_time - dep_time)
## # A tibble: 336,776 x 4
##    arr_time dep_time air_time duration
##       <int>    <int>    <dbl>    <int>
##  1      830      517      227      313
##  2      850      533      227      317
##  3      923      542      160      381
##  4     1004      544      183      460
##  5      812      554      116      258
##  6      740      554      150      186
##  7      913      555      158      358
##  8      709      557       53      152
##  9      838      557      140      281
## 10      753      558      138      195
## # ... with 336,766 more rows

The computed duration doesn’t match the air_time in the table. This format doesn’t allow us to do a subtraction because it’s written in HMM format. We can convert to minutes like we did previously. These times are also recorded in the local timezone, so need to take this into account to get an accurate flight duration.

  1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
select(flights, dep_time, sched_dep_time, dep_delay)
## # A tibble: 336,776 x 3
##    dep_time sched_dep_time dep_delay
##       <int>          <int>     <dbl>
##  1      517            515         2
##  2      533            529         4
##  3      542            540         2
##  4      544            545        -1
##  5      554            600        -6
##  6      554            558        -4
##  7      555            600        -5
##  8      557            600        -3
##  9      557            600        -3
## 10      558            600        -2
## # ... with 336,766 more rows

I would expect that dep_delay is the result of subtracting sched_dep_time from dep_time. We can add a column to verify this.

mutate(select(flights, dep_time, sched_dep_time, dep_delay), 
       difference = dep_time - sched_dep_time)
## # A tibble: 336,776 x 4
##    dep_time sched_dep_time dep_delay difference
##       <int>          <int>     <dbl>      <int>
##  1      517            515         2          2
##  2      533            529         4          4
##  3      542            540         2          2
##  4      544            545        -1         -1
##  5      554            600        -6        -46
##  6      554            558        -4         -4
##  7      555            600        -5        -45
##  8      557            600        -3        -43
##  9      557            600        -3        -43
## 10      558            600        -2        -42
## # ... with 336,766 more rows

It looks like the calculation works for determing dep_delay from the other two columns.

  1. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefuly read the documnetation for min_rank().
# add column for min rank
# filter by min rank column
# sort by the rank column
flights %>%
   mutate(flights, dep_rank=min_rank(dep_delay)) %>%
  filter(dep_rank <= 10) %>%
  arrange(dep_rank)
## # A tibble: 12 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## 11  2013     5    14      914            938       -24     1143           1204
## 12  2013     9    18     1631           1655       -24     1812           1845
## # ... with 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, dep_rank <int>

min_rank() does not break ties, but instead assigned the same rank to equal values. We could also use row_number() which breaks ties by giving a lower rank to the value that is encountered first. We can try row_number() and see how the answer changes.

# add column for row_number()
# filter by row_number column
# sort by the row_number column
flights %>% 
  mutate(flights, dep_rank=row_number(dep_delay)) %>%
  filter(dep_rank <= 10) %>%
  arrange(dep_rank)
## # A tibble: 10 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## # ... with 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, dep_rank <int>

Now we have exactly 10 flights since we broke ties with the row_number() function, whereas we had 12 flights using the min_rank() function. Depending on our needs, we might prefer one method over the other.

  1. What does 1:3 + 1:10 return? Why?
# 1:3 + 1:10

We get an error when we try to run this, saying that the longer object length is not a multiple of shorter object length. R is trying to apply the vector addition by vectorizing the input but does not know how. We can try to fix this by making the longer vector’s length a multiple of the shorter vector length.

1:3 + 1:12
##  [1]  2  4  6  5  7  9  8 10 12 11 13 15

Now we get the vector 1:12 with the vector 1:3 repeated 4 times, and added across the 12 numbers. The shorter vector was repeated to make the length of the longer vector to complete the addition operation. This is similar to adding a scalar to a vector and having the scalar become a vectorized version of itself.

  1. What trigonometric functions does R provide?

You can access the trigonometric functions documentation by typing ?Trig. R provides the following trig functions:

  • cos(x)
  • sin(x)
  • tan(x)
  • acos(x)
  • asin(x)
  • atan(x)
  • atan2(y,x)
  • cospi(x)
  • sinpi(x)
  • tanpi(x)

5.6 Grouped summaries with summarise()

The last key verb is summarise(). It collapses a dataframe to a single row.

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6

summarise() is more usefule when paired with group_by(). This changes the unit of analysis from the entire dataframe to the individual groups. We can apply the same function to the grouped dataframe to get the average delay per date.

group_by(flights, year, month, day) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

5.6.1 Combining multiple operations with the pipe

Not going to go through this. See R4ds book online.

5.6.2 Missing values

The na.rm flag is used to remove NA values before the computation. Alternatively, we can filter the NA values out of the dataframe prior to summarizing.

not_cancelled <- flights %>%
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>%
  group_by(year, month, day) %>%
  summarise(mean = mean(dep_delay))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.4 
##  2  2013     1     2 13.7 
##  3  2013     1     3 10.9 
##  4  2013     1     4  8.97
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.56
##  9  2013     1     9  2.30
## 10  2013     1    10  2.84
## # ... with 355 more rows

5.6.3 Counts

Whenever you do any aggregation, it’s always a good idea to include either a count (n()) or a count of non-missing values (sum(is.na(x))). That way you can check that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes that have the highest average delays.

delays <- not_cancelled %>% 
  group_by(tailnum) %>%
  summarize(
    delay = mean(arr_delay)
  )

ggplot(data = delays, mapping = aes(x = delay)) +
  geom_freqpoly(binwidth = 10)

We get more information if we draw a scatterplot of number of flights vs. average delay.

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )


ggplot(data = delays, mapping = aes(x=n, y = delay)) +
  geom_point(alpha = 1/10)

Now we see that there is much greater variation in the average delay when there are few flights. Typicaly, when you plot a mean (or other summary) vs. group size, you’ll see the variation decreases as the sample size increases.

When looking at this type of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups. This is what the following code does, as well as showing you a handy pattern for integrating ggplot2 into dplyr flows.

delays %>%
  filter(n > 25) %>%
  ggplot(mapping = aes(x = n, y = delay)) +
  geom_point(alpha = 1/10)

There is another common variation of this type of pattern. Let’s look at how the average performance of batters in baseball is related to the number of time they’re at bat.

library(Lahman)
## Warning: package 'Lahman' was built under R version 4.0.5
# convert to a tibble so it prints nicely
batting <- as.tibble(Lahman::Batting)
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
batters <- batting %>%
  group_by(playerID) %>%
  summarise(
    ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE), # batting average
    ab = sum(AB, na.rm = TRUE) # at bat
  )

batters %>%
  filter(ab > 100) %>%
  ggplot(mapping = aes(x=ab, y=ba)) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Note: 1. The variation in our aggregate decreases as we get more data points 2. There’s a positive correlation between skill (ba) and oppotunities to hit the ball (ab). This is because teams control who gets to play, and obviously pikc their best players.

5.6.4 Useful summary functions

It is sometimes useful to combine aggregation with logical subsetting.

not_cancelled %>%
  group_by(year, month, day) %>%
  summarise(
    avg_delay1 = mean(arr_delay),
    avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
  )
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day avg_delay1 avg_delay2
##    <int> <int> <int>      <dbl>      <dbl>
##  1  2013     1     1     12.7         32.5
##  2  2013     1     2     12.7         32.0
##  3  2013     1     3      5.73        27.7
##  4  2013     1     4     -1.93        28.3
##  5  2013     1     5     -1.53        22.6
##  6  2013     1     6      4.24        24.4
##  7  2013     1     7     -4.95        27.8
##  8  2013     1     8     -3.23        20.8
##  9  2013     1     9     -0.264       25.6
## 10  2013     1    10     -5.90        27.3
## # ... with 355 more rows

Measures of spread are also useful.

  • standard deviation: sd()
  • interquartile range: IQR()
  • median abosulation deviation: mad()
# why is the distance to some destinations more variables than others?
not_cancelled %>%
  group_by(dest) %>%
  summarise(distance_sd = sd(distance)) %>%
  arrange(desc(distance_sd))
## # A tibble: 104 x 2
##    dest  distance_sd
##    <chr>       <dbl>
##  1 EGE         10.5 
##  2 SAN         10.4 
##  3 SFO         10.2 
##  4 HNL         10.0 
##  5 SEA          9.98
##  6 LAS          9.91
##  7 PDX          9.87
##  8 PHX          9.86
##  9 LAX          9.66
## 10 IND          9.46
## # ... with 94 more rows

Measures of rank:

  • minimum: min()
  • quantiles: quantile()
  • maximum: max()
# when do the first and last flights leave each day?
not_cancelled %>%
  group_by(year, month, day) %>%
  summarise(
    first = min(dep_time),
    last = max(dep_time)
  )
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day first  last
##    <int> <int> <int> <int> <int>
##  1  2013     1     1   517  2356
##  2  2013     1     2    42  2354
##  3  2013     1     3    32  2349
##  4  2013     1     4    25  2358
##  5  2013     1     5    14  2357
##  6  2013     1     6    16  2355
##  7  2013     1     7    49  2359
##  8  2013     1     8   454  2351
##  9  2013     1     9     2  2252
## 10  2013     1    10     3  2320
## # ... with 355 more rows

Measures of position:

  • choose first: first()
  • choose nth: nth()
  • choose last: last()
not_cancelled %>%
  group_by(year, month, day) %>%
  summarise(
    first_dep = first(dep_time),
    last = last(dep_time)
  )
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day first_dep  last
##    <int> <int> <int>     <int> <int>
##  1  2013     1     1       517  2356
##  2  2013     1     2        42  2354
##  3  2013     1     3        32  2349
##  4  2013     1     4        25  2358
##  5  2013     1     5        14  2357
##  6  2013     1     6        16  2355
##  7  2013     1     7        49  2359
##  8  2013     1     8       454  2351
##  9  2013     1     9         2  2252
## 10  2013     1    10         3  2320
## # ... with 355 more rows

These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row.

not_cancelled %>%
  group_by(year, month, day) %>%
  mutate(r=min_rank(desc(dep_time))) %>%
  filter(r %in% range(r))
## # A tibble: 770 x 20
## # Groups:   year, month, day [365]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1     2356           2359        -3      425            437
##  3  2013     1     2       42           2359        43      518            442
##  4  2013     1     2     2354           2359        -5      413            437
##  5  2013     1     3       32           2359        33      504            442
##  6  2013     1     3     2349           2359       -10      434            445
##  7  2013     1     4       25           2359        26      505            442
##  8  2013     1     4     2358           2359        -1      429            437
##  9  2013     1     4     2358           2359        -1      436            445
## 10  2013     1     5       14           2359        15      503            445
## # ... with 760 more rows, and 12 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   r <int>

Counts:

  • count non-missing values: sum(!is.na())
  • count distinct (unique) values: n_distinct()
not_cancelled %>%
  group_by(dest) %>%
  summarise(carriers = n_distinct(carrier)) %>%
  arrange(desc(carriers))
## # A tibble: 104 x 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # ... with 94 more rows

dplyr provides a simple helper if all you want is a count.

not_cancelled %>%
  count(dest)
## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows

You could also use a wegiht variable. For example, you could use this to sum the total number of miles a plane flew.

not_cancelled %>%
  count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
##    tailnum      n
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ... with 4,027 more rows

Counts and proportions of logical values:

# how many flights left before 5am?
not_cancelled %>%
  group_by(year, month, day) %>%
  summarise(n_early = sum(dep_time < 500))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day n_early
##    <int> <int> <int>   <int>
##  1  2013     1     1       0
##  2  2013     1     2       3
##  3  2013     1     3       4
##  4  2013     1     4       3
##  5  2013     1     5       3
##  6  2013     1     6       2
##  7  2013     1     7       2
##  8  2013     1     8       1
##  9  2013     1     9       3
## 10  2013     1    10       3
## # ... with 355 more rows
# what proportion of flights are delayed by more than an hour?
not_cancelled %>%
  group_by(year, month, day) %>%
  summarise(hour_prop = mean(arr_delay > 60))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day hour_prop
##    <int> <int> <int>     <dbl>
##  1  2013     1     1    0.0722
##  2  2013     1     2    0.0851
##  3  2013     1     3    0.0567
##  4  2013     1     4    0.0396
##  5  2013     1     5    0.0349
##  6  2013     1     6    0.0470
##  7  2013     1     7    0.0333
##  8  2013     1     8    0.0213
##  9  2013     1     9    0.0202
## 10  2013     1    10    0.0183
## # ... with 355 more rows

5.6.5 Grouping by multiple variables

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll up a dataset.

daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights=n()))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day flights
##    <int> <int> <int>   <int>
##  1  2013     1     1     842
##  2  2013     1     2     943
##  3  2013     1     3     914
##  4  2013     1     4     915
##  5  2013     1     5     720
##  6  2013     1     6     832
##  7  2013     1     7     933
##  8  2013     1     8     899
##  9  2013     1     9     902
## 10  2013     1    10     932
## # ... with 355 more rows
(per_month <- summarise(per_day, flights=sum(flights)))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
## # A tibble: 12 x 3
## # Groups:   year [1]
##     year month flights
##    <int> <int>   <int>
##  1  2013     1   27004
##  2  2013     2   24951
##  3  2013     3   28834
##  4  2013     4   28330
##  5  2013     5   28796
##  6  2013     6   28243
##  7  2013     7   29425
##  8  2013     8   29327
##  9  2013     9   27574
## 10  2013    10   28889
## 11  2013    11   27268
## 12  2013    12   28135
(per_year <- summarise(per_month, flights = sum(flights)))
## # A tibble: 1 x 2
##    year flights
##   <int>   <int>
## 1  2013  336776

5.6.6 Ungrouping

If you need to remove grouping, and return to operations on ungrouped data, use ungroup()

daily %>%
  ungroup() %>%
  summarise(flights = n())
## # A tibble: 1 x 1
##   flights
##     <int>
## 1  336776

6.5.7 Exercises

  1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
  • a flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
not_cancelled %>%
  group_by(flight) %>%
  summarise(
    early_prop = mean(arr_delay < 0),
    late_prop = mean(arr_delay > 0)
  ) %>%
  filter(
    early_prop == 0.5,
    late_prop == 0.5
  )
## # A tibble: 147 x 3
##    flight early_prop late_prop
##     <int>      <dbl>     <dbl>
##  1     64        0.5       0.5
##  2     69        0.5       0.5
##  3    100        0.5       0.5
##  4    107        0.5       0.5
##  5    110        0.5       0.5
##  6    113        0.5       0.5
##  7    151        0.5       0.5
##  8    154        0.5       0.5
##  9    194        0.5       0.5
## 10    278        0.5       0.5
## # ... with 137 more rows
  • a flight is always 10 minutes late
not_cancelled %>%
  group_by(flight) %>%
  summarise(
    late_prop = mean(arr_delay > 0)
  ) %>%
  filter(
    late_prop == 1
  )
## # A tibble: 147 x 2
##    flight late_prop
##     <int>     <dbl>
##  1     94         1
##  2    730         1
##  3    896         1
##  4    919         1
##  5    955         1
##  6    974         1
##  7    990         1
##  8   1015         1
##  9   1084         1
## 10   1086         1
## # ... with 137 more rows
  • a flights is 30 minutes early 50% of the time, and 30 minutes late 50% of the time
not_cancelled %>%
  group_by(flight) %>%
  summarise(
    early_30 = mean(arr_delay == -30),
    late_30 = mean(arr_delay == 30)
  ) %>%
  filter(
    early_30 == 0.5,
    late_30 == 0.5
  )
## # A tibble: 0 x 3
## # ... with 3 variables: flight <int>, early_30 <dbl>, late_30 <dbl>
  • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
not_cancelled %>%
  group_by(flight) %>%
  summarise(
    early_prop = mean(arr_delay == 0),
    late_prop = mean(arr_delay == 2 * 60)
  ) %>%
  filter(
    early_prop == 0.99,
    late_prop == 0.01
  )
## # A tibble: 0 x 3
## # ... with 3 variables: flight <int>, early_prop <dbl>, late_prop <dbl>

Which is more important: arrival delay or departure delay?

We use arrival depature to compute all of these.

  1. Come up with another approach that wil give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count())
# match the first command using n() with summarise()
counts <- not_cancelled %>% count(dest)

summarise_counts <- not_cancelled %>%
  group_by(dest) %>%
  summarise(summarise_n = n())

full_join(counts, summarise_counts)
## Joining, by = "dest"
## # A tibble: 104 x 3
##    dest      n summarise_n
##    <chr> <int>       <int>
##  1 ABQ     254         254
##  2 ACK     264         264
##  3 ALB     418         418
##  4 ANC       8           8
##  5 ATL   16837       16837
##  6 AUS    2411        2411
##  7 AVL     261         261
##  8 BDL     412         412
##  9 BGR     358         358
## 10 BHM     269         269
## # ... with 94 more rows
# match the second command using sum() and summarise()
sums <- not_cancelled %>% count(tailnum, wt=distance)

summarise_sums <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise(summarise_n = sum(distance))

full_join(sums, summarise_sums)
## Joining, by = "tailnum"
## # A tibble: 4,037 x 3
##    tailnum      n summarise_n
##    <chr>    <dbl>       <dbl>
##  1 D942DN    3418        3418
##  2 N0EGMQ  239143      239143
##  3 N10156  109664      109664
##  4 N102UW   25722       25722
##  5 N103US   24619       24619
##  6 N104UW   24616       24616
##  7 N10575  139903      139903
##  8 N105UW   23618       23618
##  9 N107US   21677       21677
## 10 N108UW   32070       32070
## # ... with 4,027 more rows
  1. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?

arr_delay is the most important column because there seems to be flights with dep_delay entries even though the arr_delay is NA.

# arr_delay is NA but dep_delay is not NA
filter(flights, 
       is.na(arr_delay),
       !is.na(dep_delay))
## # A tibble: 1,175 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1     1525           1530        -5     1934           1805
##  2  2013     1     1     1528           1459        29     2002           1647
##  3  2013     1     1     1740           1745        -5     2158           2020
##  4  2013     1     1     1807           1738        29     2251           2103
##  5  2013     1     1     1939           1840        59       29           2151
##  6  2013     1     1     1952           1930        22     2358           2207
##  7  2013     1     1     2016           1930        46       NA           2220
##  8  2013     1     2      905            822        43     1313           1045
##  9  2013     1     2     1125            925       120     1445           1146
## 10  2013     1     2     1848           1840         8     2333           2151
## # ... with 1,165 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# dep_delay is NA but arr_delay is not NA - no flights here
filter(flights,
       is.na(dep_delay),
       !is.na(arr_delay))
## # A tibble: 0 x 19
## # ... with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
## #   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
  1. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
flights %>%
  group_by(year, month, day) %>%
  summarise(
    prop_cancelled = mean(is.na(arr_delay)), # proportion cancelled
    avg_delay = mean(dep_delay, na.rm = TRUE)
  ) %>%
  ggplot(aes(x=prop_cancelled, y=avg_delay)) +
  geom_point()
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.

There seems to be a slight positive relationship between avg_delay and prop_cancelled, however there are some outlier days.

  1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n)))

We have to decide whether departure or arrival delays are more important. Let’s check the relationship between the two.

ggplot(flights) +
  geom_count(aes(x=arr_delay,
                 y=dep_delay,
                size=..n..,
                color=..n..)) +
  guides(color = 'legend') +
  geom_abline(slope=1, color='red')
## Warning: Removed 9430 rows containing non-finite values (stat_sum).

The two types of delays are quite similar with no obvious pattern of one being less than the other. It looks like there are more points under the red line, indicating that arrival delays tend to be shorter than departure delays. This makes sense because time can be made up during a smooth flight. It’s probably more important to people if they arrive at their destination on time, so we’ll go with arrival time.

flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(carrier) %>%
  summarise(max_delay = max(dep_delay)) %>%
  arrange(desc(max_delay))
## # A tibble: 16 x 2
##    carrier max_delay
##    <chr>       <dbl>
##  1 HA           1301
##  2 MQ           1137
##  3 AA           1014
##  4 DL            960
##  5 F9            853
##  6 9E            747
##  7 VX            653
##  8 FL            602
##  9 EV            548
## 10 B6            502
## 11 US            500
## 12 UA            483
## 13 WN            471
## 14 YV            387
## 15 AS            225
## 16 OO            154

We remove the flights with NA values for arrival delay since we can’t do anything with those. We see that carriers with the worst delays are HA, MQ, and AA.

Now we’ll try disentagling effects of bad airports vs. bad carriers. We’ll group by the carrier and origin airport to find the max delays between these pairs. Then we can pass the results to ggplot so we can easily visualize them as a heatmap. We’re still seeing HA and MQ carriers stand out, and JFK airport seems to have a lot of arrival delays. We could follow the same procedure for destination airports as well.

flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(carrier, origin) %>%
  summarise(max_delay = max(dep_delay)) %>%
  arrange(desc(max_delay)) %>%
  ggplot() +
  geom_tile(aes(x=origin,
                y=carrier,
                fill=max_delay))
## `summarise()` has grouped output by 'carrier'. You can override using the `.groups` argument.

  1. What does the sort argument to count() do. When might you use it?

The count() function will count the unique values of one or more variables, and the sort argumnet will display the largest groups at the top.

flights %>%
  count(carrier, sort=TRUE)
## # A tibble: 16 x 2
##    carrier     n
##    <chr>   <int>
##  1 UA      58665
##  2 B6      54635
##  3 EV      54173
##  4 DL      48110
##  5 AA      32729
##  6 MQ      26397
##  7 US      20536
##  8 9E      18460
##  9 WN      12275
## 10 VX       5162
## 11 FL       3260
## 12 AS        714
## 13 F9        685
## 14 YV        601
## 15 HA        342
## 16 OO         32

5.7 Grouped mutates (and filters)

Grouping is most useful in conjunction with summarise(), but you can also do convenient operations with mutate() and filter().

Find the worst members of each group:

flights %>%
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)
## # A tibble: 3,306 x 19
## # Groups:   year, month, day [365]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      848           1835       853     1001           1950
##  2  2013     1     1     1815           1325       290     2120           1542
##  3  2013     1     1     1842           1422       260     1958           1535
##  4  2013     1     1     1942           1705       157     2124           1830
##  5  2013     1     1     2006           1630       216     2230           1848
##  6  2013     1     1     2115           1700       255     2330           1920
##  7  2013     1     1     2205           1720       285       46           2040
##  8  2013     1     1     2312           2000       192       21           2110
##  9  2013     1     1     2343           1724       379      314           1938
## 10  2013     1     2     1244            900       224     1431           1104
## # ... with 3,296 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Find all groups bigger than a threshold

popular_dests <- flights %>%
  group_by(dest) %>%
  filter(n() > 365)
popular_dests
## # A tibble: 332,577 x 19
## # Groups:   dest [77]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 332,567 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Standardise to compute per group metrics

popular_dests %>%
  filter(arr_delay > 0) %>%
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
  select(year:day, dest, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups:   dest [77]
##     year month   day dest  arr_delay prop_delay
##    <int> <int> <int> <chr>     <dbl>      <dbl>
##  1  2013     1     1 IAH          11  0.000111 
##  2  2013     1     1 IAH          20  0.000201 
##  3  2013     1     1 MIA          33  0.000235 
##  4  2013     1     1 ORD          12  0.0000424
##  5  2013     1     1 FLL          19  0.0000938
##  6  2013     1     1 ORD           8  0.0000283
##  7  2013     1     1 LAX           7  0.0000344
##  8  2013     1     1 DFW          31  0.000282 
##  9  2013     1     1 ATL          12  0.0000400
## 10  2013     1     1 DTW          16  0.000116 
## # ... with 131,096 more rows

5.7.1 Exercises

  1. Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

  2. Which plane (tailnum) has the worst on-time record?

flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(tailnum) %>%
  mutate(max_delay=max(arr_delay)) %>%
  arrange(desc(max_delay)) %>%
  select(tailnum, max_delay) %>%
  unique()
## # A tibble: 4,037 x 2
## # Groups:   tailnum [4,037]
##    tailnum max_delay
##    <chr>       <dbl>
##  1 N384HA       1272
##  2 N504MQ       1127
##  3 N517MQ       1109
##  4 N338AA       1007
##  5 N665MQ        989
##  6 N959DL        931
##  7 N927DA        915
##  8 N6716C        895
##  9 N5DMAA        878
## 10 N523MQ        875
## # ... with 4,027 more rows

The worse on-time record is held by flight N384HA with a 1272 arrival delay.

  1. What time of day should you fly if you want to avoid delays as much as possible?
flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(sched_dep_time) %>%
  summarise(avg_delay = mean(arr_delay)) %>%
  arrange(desc(avg_delay))
## # A tibble: 1,020 x 2
##    sched_dep_time avg_delay
##             <int>     <dbl>
##  1           2207     105. 
##  2           1848      69  
##  3           1752      68.2
##  4           1531      65.1
##  5           2339      59  
##  6           1739      56.1
##  7           1653      55.4
##  8           1747      54.6
##  9            558      54.3
## 10           2103      52.4
## # ... with 1,010 more rows

The worst average delay occurs with flights that are scheduled to depart at 2207.

  1. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(tailnum) %>%
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
  select(tailnum, arr_delay, prop_delay)
## # A tibble: 327,346 x 3
## # Groups:   tailnum [4,037]
##    tailnum arr_delay prop_delay
##    <chr>       <dbl>      <dbl>
##  1 N14228         11    0.0267 
##  2 N24211         20    0.0200 
##  3 N619AA         33    0.188  
##  4 N804JB        -18    0.045  
##  5 N668DN        -25   -0.198  
##  6 N39463         12    0.0519 
##  7 N516JB         19    0.00556
##  8 N829AS        -14   -0.00379
##  9 N593JB         -8   -0.00217
## 10 N3ALAA          8    0.0354 
## # ... with 327,336 more rows
  1. Delays are typically temporally correlated: even once the problem that caused the inital delay has been removed, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immeidately preceding flight.
flights %>%
  filter(!is.na(arr_delay)) %>%
  group_by(origin) %>%
  arrange(sched_dep_time) %>%
  mutate(prev_delay = lag(arr_delay)) %>%
  select(flight, arr_delay, prev_delay) %>%
  ggplot() +
  geom_point(aes(x=prev_delay,
                 y=arr_delay))
## Adding missing grouping variables: `origin`
## Warning: Removed 3 rows containing missing values (geom_point).

The y-axis shows the arrival delay for a particular flight, and the x-axis shows the arrival delay for the flight that was scheduled to depart just before it from the same origina airport. The plot doesn’t really show a strong relationship between the arrival delay of flights and the flights that departs immediately before.

  1. Look at each destination. Can you find flights that are suspicously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
flight_speed <- flights %>%
  select(flight, tailnum, origin, dest, air_time, distance) %>%
  mutate(speed =  distance / (air_time/60)) %>%
  arrange(desc(speed))

flight_speed
## # A tibble: 336,776 x 7
##    flight tailnum origin dest  air_time distance speed
##     <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>
##  1   1499 N666DN  LGA    ATL         65      762  703.
##  2   4667 N17196  EWR    MSP         93     1008  650.
##  3   4292 N14568  EWR    GSP         55      594  648 
##  4   3805 N12567  EWR    BNA         70      748  641.
##  5   1902 N956DL  LGA    PBI        105     1035  591.
##  6    315 N3768   JFK    SJU        170     1598  564 
##  7    707 N779JB  JFK    SJU        172     1598  557.
##  8    936 N5FFAA  JFK    STT        175     1623  556.
##  9    347 N3773D  JFK    SJU        173     1598  554.
## 10   1503 N571JB  JFK    SJU        173     1598  554.
## # ... with 336,766 more rows

We compute the speed of each flight by taking the distance traveled (distance) divided by the time spend in the air (air_time). The air time is in minutes, so we divide by 60 to get speed in miles per hour. We can look at the distribution of flight speeds to find outliers.

summary(flight_speed$speed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    76.8   358.1   404.2   394.3   438.8   703.4    9430

It looks like the median flight speed is about 404 miles per hour, and 75% of flights are flying under 438 miles per hour. This suggests that our fastest flights at about 600-700 miles per hour are quite fast compared to average flights.

We can now look at each flight’s air time in a destination-specific manner.

flights %>%
  group_by(origin, dest) %>%
  filter(!is.na(air_time)) %>%
  mutate(time_diff = air_time - min(air_time)) %>%
  arrange(desc(time_diff)) %>%
  select(carrier:air_time, time_diff) %>%
  group_by(origin, dest) %>%
  slice(1) %>%
  arrange(desc(time_diff))
## # A tibble: 223 x 7
## # Groups:   origin, dest [223]
##    carrier flight tailnum origin dest  air_time time_diff
##    <chr>    <int> <chr>   <chr>  <chr>    <dbl>     <dbl>
##  1 DL         841 N703TW  JFK    SFO        490       189
##  2 DL         426 N178DN  JFK    LAX        440       165
##  3 AA         575 N5DBAA  JFK    EGE        382       163
##  4 UA         745 N578UA  LGA    DEN        331       145
##  5 UA         587 N852UA  EWR    LAS        399       143
##  6 UA          15 N77066  EWR    HNL        695       133
##  7 B6          89 N794JB  JFK    SAN        413       132
##  8 UA        1075 N16709  EWR    SNA        405       131
##  9 UA        1178 N18243  EWR    AUS        301       127
## 10 B6        1295 N593JB  JFK    AUS        301       126
## # ... with 213 more rows

We group by the origin and destination so we can compare similar flights. Then compute the difference between air time and the minimum air time for flights on the same route. Then we can sort and take the flight with the largest time difference. Here we see that the largest time difference occurs for flight 841 from JFK to SFO.

  1. For each plane, count the number of flights before the first delay of greater than 1 hour.
delayed <- flights %>%
  group_by(tailnum, year, month, day) %>%
  arrange(dep_time) %>%
  summarise(num_prior_delay = sum(arr_delay < 60),
          num_delayed = sum(arr_delay > 60),
          prop_prior_delayed = num_prior_delay / (num_delayed + num_prior_delay)) %>%
  select(tailnum, year, month, day, num_prior_delay, num_delayed, prop_prior_delayed)
## `summarise()` has grouped output by 'tailnum', 'year', 'month'. You can override using the `.groups` argument.
delayed
## # A tibble: 251,727 x 7
## # Groups:   tailnum, year, month [37,988]
##    tailnum  year month   day num_prior_delay num_delayed prop_prior_delayed
##    <chr>   <int> <int> <int>           <int>       <int>              <dbl>
##  1 D942DN   2013     2    11               0           1                0  
##  2 D942DN   2013     3    23               1           0                1  
##  3 D942DN   2013     3    24               1           0                1  
##  4 D942DN   2013     7     5               1           0                1  
##  5 N0EGMQ   2013     1     1               1           1                0.5
##  6 N0EGMQ   2013     1     2               2           0                1  
##  7 N0EGMQ   2013     1     4               1           0                1  
##  8 N0EGMQ   2013     1     5               1           0                1  
##  9 N0EGMQ   2013     1     6               2           0                1  
## 10 N0EGMQ   2013     1     7               3           0                1  
## # ... with 251,717 more rows

These show the number of flights that were delayed, the number of flights prior to the delay, and the proportion of flights that occurred prior to the delay of one hour. This code assumes that once a flight is delayed by 1 hour in a specific day, it is never returns back to schedule. We can look at a summary of these numbers.

summary(delayed)
##    tailnum               year          month             day       
##  Length:251727      Min.   :2013   Min.   : 1.000   Min.   : 1.00  
##  Class :character   1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00  
##  Mode  :character   Median :2013   Median : 7.000   Median :16.00  
##                     Mean   :2013   Mean   : 6.565   Mean   :15.73  
##                     3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00  
##                     Max.   :2013   Max.   :12.000   Max.   :31.00  
##                                                                    
##  num_prior_delay  num_delayed    prop_prior_delayed
##  Min.   :0.000   Min.   :0.000   Min.   :0.000     
##  1st Qu.:1.000   1st Qu.:0.000   1st Qu.:1.000     
##  Median :1.000   Median :0.000   Median :1.000     
##  Mean   :1.205   Mean   :0.108   Mean   :0.919     
##  3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:1.000     
##  Max.   :5.000   Max.   :4.000   Max.   :1.000     
##  NA's   :6661    NA's   :6661    NA's   :6933

It looks like there is typically 1 flights before a plane is delayed by 1 at least one hour, with a maximum of 5 flights prior to the first 1 hour delay. However, it seems that most of the flights are not delayed by more than one hour in a specific day.

Chapter 6: Workflow Scripts

Not much going on in this chapter. Just a basic overview of writing scripts and using RStudio. Here’s the link: https://r4ds.had.co.nz/workflow-scripts.html

Chapter 7: Exploratory Data Analysis

7.2 Questions

Two important questions to guide your data exploration:

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?

7.3 Variation

Variation is the tendency of the values of a variable to change from measurement to measurement. The best way to understand pattern is to visualise the distribution of the variable’s values.

7.3.1 Visualising distributions

How you visualise the distribution of a variable depends on whether the variable is categorical or continuous. To examin the distribution of categorical variables, use a var chart:

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

The height of the bars displays how many observations occurred with each x value. You can also compute this number manually using dplyr::count()

diamonds %>%
  count(cut)
## # A tibble: 5 x 2
##   cut           n
##   <ord>     <int>
## 1 Fair       1610
## 2 Good       4906
## 3 Very Good 12082
## 4 Premium   13791
## 5 Ideal     21551

To examine the distribution of a continuous variable, use a histogram:

ggplot(data=diamonds) +
  geom_histogram(mapping = aes(x=carat), binwidth=0.5)

You can compute this by hand by combining dplyr::count() and ggplot2::cut_width().

diamonds %>%
  count(cut_width(carat, 0.5))
## # A tibble: 11 x 2
##    `cut_width(carat, 0.5)`     n
##    <fct>                   <int>
##  1 [-0.25,0.25]              785
##  2 (0.25,0.75]             29498
##  3 (0.75,1.25]             15977
##  4 (1.25,1.75]              5313
##  5 (1.75,2.25]              2002
##  6 (2.25,2.75]               322
##  7 (2.75,3.25]                32
##  8 (3.25,3.75]                 5
##  9 (3.75,4.25]                 4
## 10 (4.25,4.75]                 1
## 11 (4.75,5.25]                 1

You should always play with different binwidths when making histograms, as different binwidths can reveal different patterns.

# take only the diamonds with less than 3 carats
smaller <- diamonds %>%
  filter(carat < 3)

ggplot(data=smaller, mapping=aes(x=carat)) +
  geom_histogram(binwidth = 0.1)

If you wish to overlay multiple histograms in the same plot, use geom_freqpoly() instead of geom_histogram(). geom_freqpoly() performs the same calculation as geom_histogram(), but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.

ggplot(data = smaller, mapping = aes(x=carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1)

7.3.2 Typical values

Asking questions about the frequency of values are useful in data exploration. You can look for anything unexpected with questions like:

  • which values are the most common? why?
  • which values are rare? why? does that match your expectations?
  • can you see any unusual patterns? what might explain them?

As an example, the histogram below suggests several intersting questions:

  • why are there more diamonds at whole carats and common fractions of carats?
  • why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
  • why are there no diamonds bigger than 3 carats?
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:

  • how are the observations within each cluster similar to each other?
  • how are the observations in separate clusters different from each other?
  • how can you explain or describe the clusters?
  • why might the appearance of clusters be misleading?

The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.

ggplot(data = faithful, mapping = aes(x = eruptions)) +
  geom_histogram(binwidth = 0.25)

Many of the questions above will prompt you to explore a relationship between variables.

7.3.3 Unusual values

Sometimes outliers are difficult to spot on histograms. Take the distribution of the y variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

There are so many observations in the common binds, that the rare bins are so short that you can’t see them. To make things easier to see, we can zoom into the small values of the y-axi using cood_cartesian().

ggplot(diamonds) +
  geom_histogram(mapping = aes(x=y), binwidth=0.5) +
  coord_cartesian(ylim = c(0, 50))

coord_cartesian() also has an xlim() argument to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently and throw away the data outside the limits.

ggplot(diamonds) +
  geom_histogram(mapping = aes(x=y), 
                 binwidth=0.5) +
  ylim(0, 50)
## Warning: Removed 11 rows containing missing values (geom_bar).

You can see that the data has been removed when using the ggplot2 ylim function.

There appears to be three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:

unusual <- diamonds %>%
  filter(y < 3 | y > 20) %>%
  select(price, x, y, z) %>%
  arrange(y)
unusual
## # A tibble: 9 x 4
##   price     x     y     z
##   <int> <dbl> <dbl> <dbl>
## 1  5139  0      0    0   
## 2  6381  0      0    0   
## 3 12800  0      0    0   
## 4 15686  0      0    0   
## 5 18034  0      0    0   
## 6  2130  0      0    0   
## 7  2130  0      0    0   
## 8  2075  5.15  31.8  5.12
## 9 12210  8.09  58.9  8.06

The y variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can’t have a width of 0mm, so these values must be incorrect. We might also suspect that the measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don’t cost hundres of thousands of dollars!

It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.

7.3.4 Excercises

  1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
diamonds %>%
  gather(key='dimension', value='value', x:z) %>%
  ggplot() +
  geom_histogram(aes(x=value,
                 fill=dimension),
                 binwidth = 0.5) +
  facet_wrap(. ~ dimension, ncol=1) +
  coord_cartesian(xlim=c(1,10))

We see that the z dimension tends to be smaller than the x or y dimensions. The x and y dimensions are similarly distrbuted. We typically see diamonds cut with circular profiles and varying height. We might therefore guess that the width and length are along the x and y axes, while the height is along the z-axis.

  1. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
ggplot(diamonds) +
  geom_histogram(aes(x=price), binwidth = 100) +
  coord_cartesian(xlim=c(0,2000))

There is an interesting gap in the distribution right at 1500 that is visible with binwidth 100.

  1. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
ggplot(diamonds) +
  geom_histogram(aes(x=carat)) +
  coord_cartesian(xlim=c(0.99, 1.01))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is no difference between diamonds that are 0.99 carat or 1 carat. It is probably difficult to distinguish these categories.

  1. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so noly half a bar shows?

coord_cartesian() will not cut out data if it does not fit in the plot window, whereas xlim() and ylim() will cut out data if it’s entire plotting region is not shown in the window.

7.4 Missing values

You might want to replace missing values in a dataframe with NA values. You can use the ifelse() function to do this.

diamonds2 <- diamonds %>%
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

There are three arguments for ifelse(), the first argument test should be a logical vector. The result will contain the value of the second argument yes whe test is TRUE, and the value of the third argument no, when it is false.

Alternatively, you can use dplyr::case_when(), which is particularly usefule inside mutate when you want to create a new variable that relies on a complex combination of existing variables.

Other times, you might want to understand what makes observations with missing values different to observations with recorded values. For example, in nycflights13::flights, missing values in the dep_time variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times.

nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot(mapping = aes(sched_dep_time)) +
  geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)

7.4.1 Exercises

  1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
sum(is.na(diamonds2$y))
## [1] 9
ggplot(diamonds2, alpha=0.7) +
  geom_histogram(aes(x=y), fill='chocolate1') +
  geom_bar(aes(x=y), fill='cornflowerblue')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 9 rows containing non-finite values (stat_bin).
## Warning: Removed 9 rows containing non-finite values (stat_count).

In both cases, the missing values are removed with a warning.

  1. What does na.rm = TRUE do in mean() and sum()?
mean(diamonds2$y)
## [1] NA
mean(diamonds2$y, na.rm = TRUE)
## [1] 5.733801
sum(diamonds2$y)
## [1] NA
sum(diamonds2$y, na.rm = TRUE)
## [1] 309229.6

With both mean and sum, if there are missing values, the output will be NA. When we use na.rm = TRUE, the missing values are removed and we can get a numeric answer returned.

7.5 Covariation

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that should again depend on the type of variables involved.

7.5.1 A categorical and continuous variable

It’s common to want to explore the distribution of a contious variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of geom_freqpoly() is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape. For example, let’s explore how the price of a diamond varies with its quality:

ggplot(data=diamonds, mapping=aes(x=price)) +
  geom_freqpoly(mapping=aes(colour=cut), binwidth=500)

It’s hard to see the difference in distributions because the overall counts differ so much:

ggplot(diamonds) +
  geom_bar(mapping = aes(x=cut))

To make the comparison easier, we can sawp what is displayed on the y-axis. Instead of displaying count, we’ll display density, which is the count standardised so that the area under each frequency polygon is one.

ggplot(diamonds, mapping=aes(x=price, y=..density..)) +
  geom_freqpoly(mapping=aes(colour=cut), binwidth=500)

Another alternative to display the distribution of a continous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians.

Let’s take a look at the distribution of price by cut using geom_boxplot()

ggplot(diamonds, aes(x=cut, y=price)) +
  geom_boxplot()

cut is an ordered factor: fiar is worse than good, which is worse than very good, and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the reorder() function.

For example, take the class varialbe in the mpg dataset. You might be interested to know how highway mileage varies across classes:

ggplot(mpg, mapping=aes(x=class, y=hwy)) +
  geom_boxplot()

To make the trend easier to see, we can reorder class based on the median value of hwy:

ggplot(data=mpg) +
  geom_boxplot(aes(x=reorder(class, hwy, FUN=median), y=hwy))

If you have long variable names, geom_boxplot() will work better for you if you flip is 90 degrees. You can do that with coord_flip()

ggplot(mpg) +
  geom_boxplot(aes(x=reorder(class, hwy, FUN=median), y=hwy)) +
  coord_flip()

7.5.1.1 Exercises

  1. Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.
nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot(mapping = aes(x=cancelled, y=sched_dep_time)) + 
    geom_boxplot(aes(color=cancelled))

  1. What variable in the diamonds dataset is the most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
# separate categorical features
new_diamonds <- diamonds %>%
  gather(key=feature, value=value, -price) %>%
  mutate(feature_type = ifelse(feature %in% c('cut', 'color', 'clarity'),
                'cat', 'cont'))
## Warning: attributes are not identical across measure variables;
## they will be dropped
cats <- new_diamonds %>%
  filter(feature_type == 'cat')
cont <- new_diamonds %>%
  filter(feature_type == 'cont')
  

# plot the continuous features against price to visualize relationships  
ggplot(cont) +
  geom_point(aes(x=value, y=price)) +
  facet_wrap(. ~ feature, scales='free_x')

# plot categorical features
ggplot(cats) +
  geom_boxplot(aes(x=value, y=price)) +
  facet_wrap(. ~ feature, scales='free')

summary(lm(price ~ ., data=diamonds))
## 
## Call:
## lm(formula = price ~ ., data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21376.0   -592.4   -183.5    376.4  10694.2 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  5753.762    396.630   14.507  < 2e-16 ***
## carat       11256.978     48.628  231.494  < 2e-16 ***
## cut.L         584.457     22.478   26.001  < 2e-16 ***
## cut.Q        -301.908     17.994  -16.778  < 2e-16 ***
## cut.C         148.035     15.483    9.561  < 2e-16 ***
## cut^4         -20.794     12.377   -1.680  0.09294 .  
## color.L     -1952.160     17.342 -112.570  < 2e-16 ***
## color.Q      -672.054     15.777  -42.597  < 2e-16 ***
## color.C      -165.283     14.725  -11.225  < 2e-16 ***
## color^4        38.195     13.527    2.824  0.00475 ** 
## color^5       -95.793     12.776   -7.498 6.59e-14 ***
## color^6       -48.466     11.614   -4.173 3.01e-05 ***
## clarity.L    4097.431     30.259  135.414  < 2e-16 ***
## clarity.Q   -1925.004     28.227  -68.197  < 2e-16 ***
## clarity.C     982.205     24.152   40.668  < 2e-16 ***
## clarity^4    -364.918     19.285  -18.922  < 2e-16 ***
## clarity^5     233.563     15.752   14.828  < 2e-16 ***
## clarity^6       6.883     13.715    0.502  0.61575    
## clarity^7      90.640     12.103    7.489 7.06e-14 ***
## depth         -63.806      4.535  -14.071  < 2e-16 ***
## table         -26.474      2.912   -9.092  < 2e-16 ***
## x           -1008.261     32.898  -30.648  < 2e-16 ***
## y               9.609     19.333    0.497  0.61918    
## z             -50.119     33.486   -1.497  0.13448    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1130 on 53916 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9198 
## F-statistic: 2.688e+04 on 23 and 53916 DF,  p-value: < 2.2e-16

Based on the plots and the lm output, carat seems to have a huge effect on the price of a diamon as well.

ggplot(diamonds) +
  geom_boxplot(aes(x=cut, y=carat))

We see that the carat value tends to be higher for the Fair cut diamonds. This combination of high carats with Fair cuts drives prices higher for lower quality cuts.

  1. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip?
# use ggstance
ggplot(mpg, aes(hwy, class, fill=factor(cyl))) +
  geom_boxploth()

# Horizontal with coord_flip()
ggplot(mpg, aes(class, hwy, fill = factor(cyl))) +
  geom_boxplot() +
  coord_flip()

Both methods produce the same plots, but the syntax to create these differ. The ggstance syntax is a bit more intuitive, but doesn’t make a huge difference.

  1. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package and try to use geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
ggplot(diamonds) +
  geom_lv(aes(x=cut, y=price, fill=cut))

ggplot(diamonds) +
  geom_boxplot(aes(x=cut, y=price, fill=cut))

These letter-value plots give us more information beyond the quantiles of the boxplot. We see that the middle box represents the same information as boxplots, and each subsequent box that branches off of the middle encompasses 50% of the remaining data. We see that the boxes stretch further for the Very Good and Premium cuts, meaning that more people are paying higher values for those cuts. When a box is short, it means that more people are not paying very much more than the last value depicted by the previous box. These plots give us more information about the distribution of outliers and it makes sense to visualize larger data where outliers will be more numerous. One misleading feature is that the width of the boxes do not represent anything. It seems like violin plots are more intuitive than these letter-value plots.

  1. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?

  2. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

7.5.2 Two Categorical variables

To visualize covariation between categorical variables, you’ll need to count the number of observations for each combination. One way to do this is to use geom_count():

ggplot(data = diamonds) +
  geom_count(aes(x=cut, y=color))

The size of each circle represents the count at each categorical combination. Covariation will appear as strong correlation between specific x values and specific y values.

Another approach is to compute the count with dplyr:

diamonds %>%
  count(color, cut)
## # A tibble: 35 x 3
##    color cut           n
##    <ord> <ord>     <int>
##  1 D     Fair        163
##  2 D     Good        662
##  3 D     Very Good  1513
##  4 D     Premium    1603
##  5 D     Ideal      2834
##  6 E     Fair        224
##  7 E     Good        933
##  8 E     Very Good  2400
##  9 E     Premium    2337
## 10 E     Ideal      3903
## # ... with 25 more rows

Then visualize with geom_tile() and the fill aesthetic:

diamonds %>%
  count(color, cut) %>%
  ggplot(mapping = aes(x=color, y=cut)) +
  geom_tile(mapping = aes(fill=n))

If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try d3heatmap or heatmaply packages, which create interactive plots.

7.5.2.1 Exercises

  1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
diamonds %>%
  count(color, cut) %>%
  mutate(log_n = log10(n)) %>%
  ggplot(mapping = aes(x=color, y=cut)) +
  geom_tile(mapping = aes(fill=log_n))

We can take log values to get a better idea of how the counts are distributed.

  1. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
flights %>%
  group_by(dest, month) %>%
  filter(!is.na(dep_delay)) %>%
  summarise(delays=mean(dep_delay)) %>%
  ggplot(aes(x=month, y=dest)) +
  geom_tile(aes(fill=delays))
## `summarise()` has grouped output by 'dest'. You can override using the `.groups` argument.

There were a lot of NA values that made the plot very difficult to read. After removing them, we have a cleaner plot that we could save to a larger file which would be readable. There are still many destinations that make this plot crowded. If we could filter these down to those that we care about or group them, we’d have a cleaner plot. The months would also make more sense if they were factors rather than numerical.

  1. Why is it slightly better to use aes(x=color, y=cut) rather than aes(x=cut, y=color) in the example above?
ggplot(data = diamonds) +
  geom_count(aes(x=cut, y=color))

ggplot(data = diamonds) +
  geom_count(aes(x=color, y=cut))

7.5.3 Two continuous variables

One great way to visualize the covariation between two continuous variables is with a scatter plot. For example, we can easily see an exponential relationship between the carat size and price of a diamon.

ggplot(diamonds) +
  geom_point(aes(x=carat, y=price))

However, scatterplots become less useful as the size of your dataset grows, becaus epoint begin to overplot and pile up into areas of uniform black (as above). You can add the alpha aesthetic to add transparency to help with this.

ggplot(diamonds) +
  geom_point(aes(x=carat, y=price), alpha = 1/100)

Using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used geom_histogram() and geom_freqpoly() to bin in one dimension. Now you’ll learn how to use geom_bin2d() and geom_hex() to bin in two dimensions.

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins. You need to install the hexbin package to use geom_hex().

ggplot(smaller) +
  geom_bin2d(aes(x=carat, y=price))

# install.packages('hexbin')
ggplot(smaller) +
  geom_hex(aes(x=carat, y=price))

Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the technique for visualising the combination of a categorical and continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot:

ggplot(smaller, aes(x=carat, y=price)) +
  geom_boxplot(aes(group=cut_width(carat, 0.1)))

cut_width(x, width) as used above, divides x into bind of width width. By default, boxplots look roughly the same (apart from the number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summarises a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth=TRUE.

ggplot(smaller, aes(x=carat, y=price)) +
  geom_boxplot(aes(group=cut_width(carat, 0.1)), varwidth=TRUE)

Another approach is to display approximately the same number of points in each bin. That’s the job of cut_number():

ggplot(smaller, aes(x=carat, y=price)) +
  geom_boxplot(aes(group=cut_number(carat, 20)))

7.5.3.1 Exercises

  1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?

cut_width() will give you even bin sizes, but it won’t take into account the number of points within each of those bins. You could use cut_width() in combination with varwidth=TRUE to give an idea of how many points are within each bin. cut_number() will give you bins with the same number of points in the, but the sizes of each bin will differ from one another. If you have bins with unequal numbers of points in them, then it would be more informative to plot this information in the plot. We can see in the plots above how these two different option look. cut_number() offers a much more apparent reflection of the different number of points contained within regions.

  1. Visualise the distribution of carat, partitioned by price.
ggplot(smaller, aes(x=price, y=carat)) +
  geom_boxplot(aes(group=cut_number(price, 10)))

ggplot(smaller, aes(x=price, y=carat)) +
  geom_boxplot(aes(group=cut_width(price, 1000)))

  1. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?
# create column for size
diamonds %>%
  mutate(size=x*y*z) %>%
ggplot() +
  geom_hex(aes(x=size, y=price))

For the most part, the price increases as size increases. There are a couple of point on this plot that show some diamonds with very large size and relatively low price. The price of these outlier points do not appear to be fully dependent on size.

  1. Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.
ggplot(diamonds) +
  geom_hex(aes(x=carat, y=price)) +
  facet_wrap(. ~ cut)

  1. Two dimensional plots reveal outliers that are not visible in one dimonsional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

Why is a scatterplot a better display than a binned plot for this case?

The scatterplot is a better choice here because a binned plot tends to hide the outliers in the bins. For example, in the plot below, you can still see these outliers, but they do not stand out as much as in the scatter plot.

ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = x, y = y, group=cut_number(x, 10))) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))